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1. Introduction. We thank all the discussants for their careful reading 
of our paper, and for their insightful critiques. We would also like to thank 
the editors for organizing this discussion. Our paper contributes to the area 
of high-dimensional statistics which has received much attention over the 
past several years across the statistics, machine learning and signal process- 
ing communities. In this rejoinder we clarify and comment on some of the 
points raised in the discussions. Finally, we also remark on some interesting 
challenges that lie ahead in latent variable modeling. 

Briefly, we considered the problem of latent variable graphical model se- 
lection in the Gaussian setting. Specifically, let X be a zero-mean Gaussian 
random vector in W +h with O and H representing disjoint subsets of indices 
in {1, . . . ,p + h} with \0\ = p and \H\ = h. Here the subvector Xo represents 
the observed variables and the subvector Xh represents the latent variables. 
Given samples of only the variables Xo, is it possible to consistently per- 
form model selection? We noted that if the number of latent variables h is 
small relative to p and if the conditional statistics of the observed variables 
Xo conditioned on the latent variables Xh are given by a sparse graphi- 
cal model, then the marginal concentration matrix of the observed variables 
Xo is given as the sum of a sparse matrix and a low-rank matrix. As a first 
step we investigated the identifiability of latent variable Gaussian graphical 
models — effectively, this question boils down to one of uniquely decompos- 
ing the sum of a sparse matrix and a low-rank matrix into the individual 
components. By studying the geometric properties of the algebraic varieties 
of sparse and low-rank matrices, we provided natural sufficient conditions 
for identifiability and gave statistical interpretations of these conditions. Sec- 
ond, we proposed the following regularized maximum-likelihood estimator to 
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decompose the concentration matrix into sparse and low-rank components: 
(5„, L n ) = arg min -£(S - L; Eg) + A n ( 7 ||5||i + tr(L)) 

(1.1) 

s.t. S - LyO,LyO. 

Here Eq represents the sample covariance formed from n samples of the 
observed variables, t is the Gaussian log-likelihood function, S n represents 
the estimate of the conditional graphical model of the observed variables 
conditioned on the latent variables, and L n represents the extra correla- 
tions induced due to marginalization over the latent variables. The l\ norm 
penalty induces sparsity in S n and the trace norm penalty induces low-rank 
structure in L n . An important feature of this estimator is that it is given 
by a convex program that can be solved efficiently. Our final contribution 
was to establish the high-dimensional consistency of this estimator under 
suitable assumptions on the Fisher information underlying the true model 
(in the same spirit as irrepresentability conditions for sparse model selection 
[11, 16]). 

2. Alternative estimators. A number of the commentaries described al- 
ternative formulations for estimators in the latent variable setting. 

2.1. EM-based methods. The discussions by Yuan and by Lauritzen and 
Meinshausen describe an EM-based alternative in which the rank of the 
matrix L is explicitly constrained: 

(S n ,L n ) = arg min -£(S - L; E&) + A n ||5||i 

(2.1) 

s.t. S - L y 0, L y 0, rank(L) < r. 

The experimental results based on this approach seem quite promising, and 
certainly deserve further investigation. On the one hand, we should reiterate 
that the principal motivation for our convex optimization based formulation 
was to develop a method for latent variable modeling with provable statisti- 
cal and computational guarantees. One of the main drawbacks of EM-based 
methods is the existence of local optima in the associated variational for- 
mulations, thus leading to potentially different solutions depending on the 
initial point. On the other hand, one of the reasons for the positive em- 
pirical behavior observed by Yuan and by Lauritzen and Meinshausen may 
be that all the local optima in the experimental settings considered by the 
authors may be "good" models. Such behavior has in fact been rigorously 
characterized recently for certain nonconvex estimators in some missing data 
problems [7]. 

One of the motivations for the EM proposal of Yuan and of Lauritzen and 
Meinshausen seems to be that there are fairly mature and efficient solvers for 
the graphical lasso. As our estimator is relatively newer and as its properties 
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are better understood going forward, we expect that more efficient solvers 
will be developed for (f.f) as well. Indeed, the LogdetPPA solver [15] that 
we cite in our paper already scales to instances involving several hundred 
variables, while more recent efforts [8] have resulted in algorithms that scale 
to instances with several thousand variables. 

2.2. Thresholding estimators. Ren and Zhou propose and analyze an in- 
teresting thresholding based estimator for decomposing a concentration ma- 
trix into sparse and low-rank components. They apply a two-step procedure — 
l\ norm thresholding followed by trace norm thresholding — to obtain the 
sparse component followed by the low-rank component. Roughly speaking, 
this two-step estimator can be viewed as the application of the first cycle of 
a block coordinate descent procedure to compute our estimator that alter- 
nately updates the sparse and low-rank pieces (we also refer the reader to 
the remarks in [1]). 

However, in Theorem 1 in the discussion by Ren and Zhou, a quite strin- 
gent assumption requires that in some scaling regimes the true low-rank 



component L* must vanish, that is, \\L* < y -^p — > 0. The reason for this 
condition is effectively to ensure sign consistency in recovering the sparse 
component. In a pure sparse model selection problem (with no low-rank com- 
ponent in the population), the deviation away from the sparse component 
is given only by noise due to finite samples and this deviation is on the or- 



der of J -^p in the Gaussian setting — consequently, sparse model selection 
via t\ norm thresholding is sign-consistent when the minimum magnitude 



nonzero entry in the true model is larger than y -^p. In contrast, if the 
true model consists of both a sparse component and a low-rank component, 
the total deviation away from the sparse component in the finite sample 
regime is given by both sample noise as well as the low-rank component. 
This seems to be the reason for the stringent assumption on the vanishing 
of the low-rank component in Theorem 1 of Ren and Zhou. 

More broadly, one of the motivations of Ren and Zhou in proposing and 
analyzing their estimator is that it may be possible to weaken the assump- 
tions on the minimum magnitude nonzero entry 6 of the true sparse com- 
ponent S* and the minimum nonzero singular value a of the true low-rank 
component L* — whether this is possible under less stringent assumptions on 
L* is an interesting question, and we comment on this point in Section 3 in 
the more general context of potentially improving the rates in our paper. 

2.3. Other proposals. Giraud and Tsybakov propose two alternative es- 
timators for decomposing a concentration matrix into sparse and low-rank 
components. While our approach (1.1) builds on the graphical lasso, their 
proposed approaches build on the Dantzig selector of Candes and Tao [2] 
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and the neighborhood selection approach of Meinshausen and Biihlmann [9]. 
Several comments are in order here. 

First, we note that the extension of neighborhood selection proposed by 
Giraud and Tsybakov to deal with the low-rank component begins by refor- 
mulating the neighborhood selection procedure to obtain a "global" estima- 
tor that simultaneously estimates all the neighborhoods. This reformulation 
touches upon a fundamental aspect of latent variable modeling. In many 
applications marginalization over the latent variables typically induces cor- 
relations between most pairs of observed variables — consequently, local pro- 
cedures that learn model structure one node at a time are ill-suited for latent 
variable modeling. Stated differently, requiring that a matrix be sparse with 
few nonzeros per row or column (e.g., expressing preference for a graphical 
model with bounded degree) can be done by imposing column-wise con- 
straints. On the other hand, the constraint that a matrix be low-rank is 
really a global constraint expressed by requiring all minors of a certain size 
to vanish. Thus, any estimator for latent variable modeling (in the absence 
of additional conditions on the latent structure) must necessarily be global 
in nature. 

Second, we believe that the reformulation based on the Dantzig selector 
perhaps ought to have an additional constraint. Recall that the Dantzig se- 
lector [2] constrains the £oo norm (the dual norm of the l\ norm) of the 
correlated residuals rather than the £2 norm of the residuals as in the lasso. 
As the dual norm of our combined £\ /trace norm regularizer involves both 
an loo norm and a spectral norm, the following constraint set may be more 
appropriate in the Dantzig selector based reformulation of Giraud and Tsy- 
bakov: 

g = {(S, L) : ||Eg(S + L) - I|| /oo < 7 A„, \\T^(S + L) - I\\ 2 < K}- 

Finally, we note that the Dantzig selector of [2] has the property that 
its constraint set contains the lasso solution (with the same choice of reg- 
ularization/relaxation parameters). In contrast, this property is not shared 
in general by the Dantzig selector reformulation of Giraud and Tsybakov 
in relation to our regularized maximum-likelihood estimator (1.1). It is un- 
clear how one might achieve this property via suitable convex constraints in 
a Dantzig selector type reformulation of our estimator. 

In sum, both of these alternative estimators deserve further study. 

3. Comments on rates. Several of the commentaries (Wainwright, Gi- 
raud and Tsybakov, Ren and Zhou and Candes and Soltanolkotabi) bring 
up the possibility of improving the rates given in our paper. At the outset 
we believe that n>p samples is inherent to the latent variable modeling 
problem if spectral norm consistency is desired in the low-rank component. 
This is to be expected since the spectral norm of the deviation of a sample 
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covariance from the underlying population covariance is on the order of 



However, some more subtle issues remain. 

Giraud and Tsybakov point out that one may be concerned purely with 
estimation of the sparse component, and that the low-rank component may 
be a "nuisance" parameter. While this is not appropriate in every applica- 
tion, in problem domains where the conditional graphical model structure 
of the observed variables is the main quantity of interest one can imag- 
ine quantifying deviations in the low-rank component via "weaker" norms 
than the spectral norm — this may lead to consistent estimates for the sparse 
component with n <^.p samples. The analysis in our paper does not rule out 
this possibility, and a more careful investigation is needed to establish such 
results. 

Ren and Zhou suggest that while n>p may be required for consistent 
estimation, one may be able to weaken the assumptions on 9 and a (the min- 
imum magnitude nonzero entry of the sparse component and the minimum 
nonzero singular value of the low-rank component, respectively). From the 
literature on sparse model selection, a natural lower bound on the minimum 
magnitude nonzero entry for consistent model selection is typically given by 
the size of the noise measured in the norm (the dual of the l\ regular- 
ize^ . Building on this intuition, a natural lower bound that one can expect 
in our setting on 6 is ^||Xq — EH^, while a natural bound on a would be 

1 1 S 1 1 

\\Tjq — XI || 2 - The reason for this suggestion is that max{ , H-LH2} is the 
dual norm of the regularizer used in our paper. Therefore, it may be possible 



to only require 9 ~ an d a ~ y n - However, one issue here is that 

the £oo norm bound kicks in when n ^ logp with probability approaching 
one polynomially fast, while the spectral norm bound only kicks in when 
n>p but holds with probability approaching one exponentially fast. Thus 
(as also noted by Giraud and Tsybakov), it may be possible that n >p is re- 
quired for overall consistent estimation, but that the assumption on 9 could 
be weakened by only requiring that the probability of consistent estimation 
approach one polynomially fast. 

Candes and Soltanolkotabi comment that it would be of interest to es- 
tablish an "adaptivity" property whereby if no low-rank component were 
present, the number of samples required for consistent estimation would boil 
down to just the rate for sparse graphical model selection, that is, n ~ logp. 
While such a feature would clearly be desirable to establish for our estima- 
tor, one potential roadblock may be that our estimator (1.1) "searches" over 
a larger classes of models than just those given by sparse graphical models; 
consequently, rejecting the hypothesis that the observed variables are af- 
fected by any latent variables may require that n 3> \ogp. This question de- 
serves further investigation and, as suggested by Candes and Soltanolkotabi, 
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recent results on adaptivity could inform a more refined analysis of our es- 
timator. 

Finally, Wainwright suggests the intriguing possibility that faster rates 
may be possible if the low-rank component has additional structure. For 
example, there may exist a sparse factorization of the low-rank component 
due to special structure between the latent and observed variables. In such 
settings the trace norm regularizer applied to the low-rank component is not 
necessarily the tightest convex penalty. In recent joint work by the authors 
and Recht [4] , a general framework for constructing convex penalty functions 
based on some desired structure is presented. The trace norm penalty for 
inducing low-rank structure is motivated from the viewpoint that a low-rank 
matrix is the sum of a small number of rank-one matrices and, therefore, the 
norm induced by the convex hull of the rank-one matrices (suitably scaled) 
is a natural convex regularizer as this convex hull (the trace norm ball) 
carries precisely the kind of facial structure required for inducing low-rank 
structure in matrices. In this spirit, one can imagine constructing convex 
penalty functions by taking the convex hull of sparse rank-one matrices. 
While this convex hull is in general intractable to represent, relaxations of 
this set that are tighter than the trace norm ball could provide faster rates 
than can be obtained by using the trace norm. 

4. Weakening of irrepresentability conditions. Wainwright asks a num- 
ber of insightful questions regarding the potential for weakening our Fisher 
information based conditions. Giraud and Tsybakov also bring up connec- 
tions between our conditions and irrepresentability conditions in previous 
papers on sparse model selection [11, 16]. 

In order to better understand if the Fisher information based conditions 
stated in our paper are necessary, Wainwright raises the question of ob- 
taining a converse result by comparing to an oracle method that directly 
minimizes the rank and the cardinality of the support of the components. 
A difficulty with this approach is that we don't have a good handle on the set 
of matrices that are expressible as the sum of a sparse matrix and a low-rank 
matrix. The properties of this set remain poorly understood, and developing 
a better picture has been the focus of research efforts in algebraic geome- 
try [6] and in complexity theory [14]. Nonetheless, a comparison to oracle 
estimators that have side information about the support of the sparse com- 
ponent and the row/column spaces of the low-rank component (in effect, side 
information about the tangent spaces at the two components) appears to be 
more tractable. This is closer to the viewpoint we have taken in our paper 
in which we consider the question of identifiability of the components given 
information about the underlying tangent spaces. Essentially, our Fisher in- 
formation conditions state that these tangent spaces must be sufficiently 
transverse with respect to certain natural norms and in a space in which the 
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Fisher information is the underlying inner-product. More generally, as also 
pointed out by Giraud and Tsybakov, the necessity of Fisher information 
based conditions is an open question even in the sparse graphical model se- 
lection setting considered in [11]. The experimental studies in [11] describing 
comparisons to neighborhood selection in some simple cases provide a good 
starting point. 

Wainwright raises the broader question of consistent model selection when 
transversality of the underlying tangent spaces does not hold. One ap- 
proach [1] is to quantify the level of identifiability based on a "spikiness" 
condition. A more geometric viewpoint may be that only those pieces of the 
sparse and low-rank components that do not lie in the intersection of their 
underlying tangent spaces are fundamentally identifiable and, therefore, con- 
sistency should be quantified with respect to these identifiable pieces. 

Giraud and Tsybakov ask about the interpretability of our conditions £ (T) 
and fj,(Q). These quantities are geometric in nature and relate to the tangent 
space conditions for identifiability. In particular, they are closely related to 
(and bounded by) the incoherence of the row/column spaces of the low- 
rank component and the maximum number of nonzeros per row/column [5]. 
These latter quantities have appeared in many papers on sparse graphical 
model selection (e.g., [9, 11]) as well as on low-rank matrix completion [3], 
and computing them is straightforward. In our previous work on matrix 
decomposition [5], we note that these quantities are bounded for natural 
random families of sparse and low-rank matrices based on results in [3] . 

5. Experimental issues and applications. Lauritzen and Meinshausen as 
well as Giraud and Tsybakov raise several points about the choice of the reg- 
ularization parameters. Choosing these parameters in a data-driven manner 
(e.g., using the methods described in [10]) is clearly desirable. We do wish to 
emphasize that the sensitivities of the solution with respect to the parame- 
ters X n and 7 are qualitatively different. As described in our main theorem 
and in our experimental section, the solution of our estimator (1.1) is stable 
for a range of values of 7 (see also [5]) — this point is observed by Yuan as 
well in his experiments. Further, the choice of 7 ideally should not depend 
on n, while the choice of A n clearly should. 

On a different point regarding experimental results, Giraud and Tsybakov 
suggest at the end of their discussion that latent variable models don't seem 
to provide significantly more expressive power than a sparse graphical model. 
In contrast, Yuan's synthetic experiment seems to provide compelling evi- 
dence that our approach (1.1) provides better performance relative to models 
learned by the graphical lasso. The reason for these different observations 
may be tied to the manner in which their synthetic models were gener- 
ated. Specifically, latent variable model selection using (1.1) is likely to be 
most useful when the latent variables affect many observed variables upon 
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marginalization (e.g., latent variables are connected to many observed vari- 
ables), while the conditional graphical model among the observed variables 
conditioned on the latent variables is sparse and has bounded degree. This 
intuition is based on the theoretical analysis in our paper and is also the 
setting considered in the experiment in Yuan's discussion (as well as in the 
synthetic experiments in our paper). On the other hand, the experimental 
setup followed by Giraud and Tsybakov seems to generate a graphical model 
with large maximum degree and low average degree, and randomly selects 
a subset of the variables as latent variables. It is not clear if these latent 
variables are the ones with large degree, which may explain their remarks. 

Finally, we note that sparse and low-rank matrix decomposition is relevant 
in applications beyond the one described in our paper. As observed by Lau- 
ritzen and Meinshausen, a natural matrix decomposition problem involving 
covariance matrices may arise if one considers directed latent variable mod- 
els in the spirit of factor analysis. In such a context the covariance matrix 
may be expressed as the sum of a low-rank matrix and a sparse (rather than 
just diagonal) matrix, corresponding to the setting in which the distribution 
of the observed variables conditioned on the latent variables is given by a 
sparse covariance matrix. More broadly, similar matrix decomposition prob- 
lems arise in domains beyond statistical estimation such as optical system 
decomposition, matrix rigidity and system identification in control [5], as 
well as others as noted by Candes and Soltanolkotabi. 

6. Future questions. Our paper and the subsequent discussions raise a 
number of research and computational challenges in latent variable modeling 
that we wish to highlight briefly. 

6.1. Convex optimization in R. As mentioned by Lauritzen and Mein- 
shausen, R remains the software of choice for practitioners in statistics. How- 
ever, some of the recent advances in high-dimensional statistical estimation 
have been driven by sophisticated convex optimization based procedures 
that are typically prototyped using packages such as SDPT3 [13] and others 
in Matlab and Python. It would be of general interest to develop packages 
to invoke SDPT3 routines directly from R. 

6.2. Sparse/low-rank decomposition as infimal convolution. Given a ma- 
trix M y 0, consider the following function: 

(6.1) \\M\\ S / L „ = min7||5||^ +tr(L), s.t. M = S - L, L y 0. 

It is clear that || • Ws/L,^ is a norm, and it can be viewed as the infimal 
convolution [12] of the (scaled) i\ norm and the trace norm. In essence, 
it is a norm whose minimization induces matrices expressible as the sum of 
sparse and low-rank components (see also the atomic norm viewpoint of [4]). 



REJOINDER 



9 



We could then effectively restate (1.1) as 

M n = arg min -£(M; Eg) + A n ||M|| 5/i 

and then decompose M n by solving (6.1). This two-step approach suggests 
the possibility of decoupling the decomposition problem from the conditions 
fundamentally required for consistency via regularized maximum-likelihood, 
as the latter only ought to depend on the composite norm || • ||s/l j7 . This 
decoupling also highlights the different roles played by the parameters A n 
and 7 (as discussed in Section 5). More broadly, such an approach may be 
useful as one analyzes general regularizers, for example, convex penalties 
other than the trace norm as described in Section 3. 

6.3. Non-Gaussian latent variable modeling. As described in our paper 
and as raised by Wainwright, latent variable modeling with non-Gaussian 
variables is of interest in many applications. Both the computational and 
algebraic aspects present major challenges in this setting. Specifically, the 
secant varieties arising due to marginalization in non-Gaussian models (e.g., 
in models with categorical variables) are poorly understood, and computing 
the likelihood is also intractable. An approach based on matrix decomposi- 
tion as described in our paper may be appropriate, although one would have 
to quantify the effects of the Gaussianity assumption. 
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